class: center, middle, inverse, title-slide # Censored regression, selection model, weak IV, and quantile regression ## Tutorial 1 ### Stanislav Avdeev --- # Tutorials - 7 TA sessions - 6 TA sessions are about lecture material - The last session is primarily about exam and remaining questions about the course material (TBA) - Send me **any** questions you want to discuss before each TA session - Use Canvas or send me an email (s.avdeev@tinbergen.nl) - Alternately, leave your questions anonymously here: https://www.menti.com/c6uyd9qan4 (I will update the link each week on Canvas) --- # Assignments - Due date: 11:59pm on Sundays (the first assignment is an exception: 11:59am on Tuesday) - Assignments are graded within a week from the deadline - Solutions will not be shared so if you want to discuss a specific exercise, let me know before the TA session (you submit your solutions on Sunday, thus, we can discuss any questions on the following TA session on Tuesday) --- # Course objective * The key objective of the course is **applying** microeconometric techniques rather than **deriving** econometric and statistical properties of estimators * In other words, there’s way less of this: `\begin{align*} \text{plim} \hat{\beta}_{OLS} = \beta + \text{plim} (\frac{1}{N}X'X)^{-1} \text{plim} \frac{1}{N} X' \varepsilon = \beta + Q^{-1} \text{plim} \frac{1}{N} X' \varepsilon \end{align*}` * And way more of this: ```r library(fixest) tb <- tibble(groups = sort(rep(1:10, 600)), time = rep(sort(rep(1:6, 100)), 10), Treated = I(groups > 5) * I(time > 3), Y = groups + time + Treated * 5 + rnorm(6000)) m <- feols(Y ~ Treated | groups + time, data = tb) ``` If you would like to go deeper into the former, take Advanced Econometrics I and II next year --- # Censored regression - Censoring occurs when the value of a variable is limited due to some constraint - For example, we tend not to see wages below the federal minimum wage - In this case OLS estimates are biased - A standard method to account for censoring is to combine a probit with OLS, i.e. tobit-model --- # Censored regression: simulation - The clearest way to understand how a certain estimator works is to generate data yourself so you know the true **data generating process** - DGP - Let's estimate returns to education: does education increase wages? - But suppose that we do not observe wages below a specific threshold (can happen due to privacy concerns, coding, etc.) - We need to generate data containing years of education and wages ```r # Alsways set seed so you can replicate your results set.seed(7) df <- tibble(education = runif(1000, 5, 15), wage_star = 500 + 150*education + rnorm(1000, 0, 100), wage = ifelse(wage_star < 1500, 1500, wage_star)) %>% arrange(desc(wage_star)) ``` --- # Censored regression: simulation - Let's look at the head and tail of our dataset ``` ## # A tibble: 6 × 3 ## education wage_star wage ## <dbl> <dbl> <dbl> ## 1 14.9 2957. 2957. ## 2 14.8 2932. 2932. ## 3 14.9 2921. 2921. ## 4 14.4 2857. 2857. ## 5 14.4 2854. 2854. ## 6 13.7 2842. 2842. ``` ``` ## # A tibble: 6 × 3 ## education wage_star wage ## <dbl> <dbl> <dbl> ## 1 5.51 1167. 1500 ## 2 5.27 1150. 1500 ## 3 5.48 1147. 1500 ## 4 5.26 1127. 1500 ## 5 5.04 1127. 1500 ## 6 5.06 1084. 1500 ``` --- # Censored regression: OLS Now let's pretend that we do not know the DGP and simply apply OLS ```r ols_model <- lm(wage ~ education, df) ``` <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Model 1 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 625.446*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (11.480) </td> </tr> <tr> <td style="text-align:left;"> education </td> <td style="text-align:center;"> 139.354*** </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1px"> </td> <td style="text-align:center;box-shadow: 0px 1px"> (1.097) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 1000 </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> - Using these OLS estimates, we would conclude that "an additional year of education is associated with 139.354 increase in monthly wages" --- # Censored regression: tobit-model - But these are biased estimates since we know the true effect is `\(150\)` (remember DGP) - Let's try to recover unbiased effects of education on wages by using tobit-model - The solution provided by the tobit-model is to - use a probit to account for the censoring - estimate OLS on the non-censored data - Tobit-model estimator is easy to implement with `censReg` package --- # Censored regression: tobit-model Remember that we have left censored wages: wages below `\(1500\)` are coded as `\(1500\)` ```r library(censReg) tobit_model <- censReg(wage ~ education, data = df, left = 1500) ``` <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Model 1 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 470.582*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (14.184) </td> </tr> <tr> <td style="text-align:left;"> education </td> <td style="text-align:center;"> 152.273*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (1.299) </td> </tr> <tr> <td style="text-align:left;"> logSigma </td> <td style="text-align:center;"> 4.599*** </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1px"> </td> <td style="text-align:center;box-shadow: 0px 1px"> (0.024) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 1000 </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> We recovered the unbiased estimates of returns to education --- # Censored regression: graphically We are going to use a lot of graphs since they provide more intuition of what is happening in the regression models <!-- --> --- # Censored regression: some remarks - You can specify both left and right censoring using `censReg` function - Important assumption of the tobit-model is that the unobserved term is normally distributed (which is the case in our simulated dataset) - If the data is missing not because the outcome variable is **above (or below)** some threshold but because individuals in the data have made a **choice** such that we can't observe their outcome variable, we can't use censoring - Censoring cannot be applied because the availability of data is influenced by the choice of agents (i.e. selection on unobservables) - It is a typical sample selection problem --- # Sample selection model - Let us consider the case of studying female’s wages - Usually, wages are observed for a fraction of women in the sample, whereas the remaining part of women are observed as unemployed or inactive - If we run OLS regression using the observed wages, this would deliver consistent estimations only if working females are a random sample of the population - However,theory of labor supply suggests that this may not be the case, since (typically) female labor supply is sensitive to household decisions - That is, female workers self-select into employment, and the self-selection is not random - This difference may lead us to underestimate the gender wage gap --- # Sample selection model - Suppose a female worker decides to work or not `\(I_i^*\)` depending on a set of observed `\(Z_i\)` and unobserved `\(V_i\)` characteristics `\begin{align*} I_i^* = Z_i ' \gamma + V_i \end{align*}` - This indicator function (decision to work or not) takes two values `\begin{align*} I_i = \begin{cases} \mbox{} 1 \text{ (working) } \ & \mbox{} \text{ if } I_i^* > 0 \\ \mbox{} 0 \text{ (not working) } & \mbox{} \text{ if } I_i^* \leq 0 \end{cases} \end{align*}` - Suppose there is a latent outcome `\(Y_i^*\)`, i.e. wages of female workers, which depend on a set of observed `\(X_i\)` and unobserved `\(U_i\)` characteristics `\begin{align*} Y_i^* = X_i ' \beta + U_i \end{align*}` - However, we observe wages only for females who decided to work. `\(Y_i\)` are observed wages that equal to `\begin{align*} Y_i = \begin{cases} \mbox{} Y_i^* \ & \mbox{} \text{ if } I_i = 1 \\ \mbox{} \text{missing} & \mbox{} \text{ if } I_i = 0 \end{cases} \end{align*}` --- # Sample selection model: simulation ```r library(mvtnorm) # to simulate bivariate normal random variable set.seed(7) df <- tibble(z = runif(100), x = runif(100), uv = rmvnorm(100, mean = c(0, 0), sigma = rbind(c(1, 0.5), c(0.5, 1))), i_star = 4 - 5 * z + uv[, 1], y_star = 6 - 3 * x + uv[, 2], y = ifelse(i_star > 0, y_star, 0)) head(df) ``` ``` ## # A tibble: 6 × 6 ## z x uv[,1] [,2] i_star y_star y ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.989 0.396 0.653 0.702 -0.291 5.51 0 ## 2 0.398 0.486 -0.381 -1.15 1.63 3.39 3.39 ## 3 0.116 0.497 -0.117 -1.47 3.30 3.04 3.04 ## 4 0.0697 0.387 1.22 1.24 4.87 6.08 6.08 ## 5 0.244 0.643 0.843 0.333 3.62 4.40 4.40 ## 6 0.792 0.344 -0.293 0.333 -0.253 5.30 0 ``` --- # Sample selection model: simulation The true effect of `\(Z\)` on `\(I\)` (decision to work) is `\(-5\)` and the true effect `\(X\)` on `\(Y\)` (wages) is `\(-3\)` ```r selection_equation <- glm(I(y > 0) ~ z, df, family = binomial(link = "probit")) wage_equation <- lm(y ~ x, df) ``` <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto;" class="table"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Model 1 </th> <th style="text-align:center;"> Model 2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:center;"> 5.703*** </td> <td style="text-align:center;"> 4.836*** </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (1.192) </td> <td style="text-align:center;"> (0.498) </td> </tr> <tr> <td style="text-align:left;"> z </td> <td style="text-align:center;"> −7.479*** </td> <td style="text-align:center;"> </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (1.565) </td> <td style="text-align:center;"> </td> </tr> <tr> <td style="text-align:left;"> x </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> −2.499** </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1px"> </td> <td style="text-align:center;box-shadow: 0px 1px"> </td> <td style="text-align:center;box-shadow: 0px 1px"> (0.817) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 100 </td> <td style="text-align:center;"> 100 </td> </tr> </tbody> <tfoot><tr><td style="padding: 0; " colspan="100%"> <sup></sup> + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001</td></tr></tfoot> </table> - Clearly, the estimates are biased since `\(cov(U_i, V_i) \neq 0\)` --- # Sample selection model: assumptions To estimate the sample selection model, distributional assumptions on the disturbances terms are made, such as bivariate normality `\begin{align*} \left[\begin{array}{l} U_{i} \\ V_{i} \end{array}\right] \sim \mathcal{N}\left(0,\left[\begin{array}{cc} \sigma^{2} & \rho \sigma \\ \rho \sigma & 1 \end{array}\right]\right) \end{align*}` - Note that the variance of the normal distribution is not identified in the probit model so it is set to 1 - To solve the sample selection problem, one needs to use Heckman selection model (left as an exercise in the first assignment) - Heckman estimator is very similar to the Tobit estimator - The difference is that this estimator allows for a set of characteristics that determine whether or not the outcome variable is censored --- # Weak instrument problem - If `\(Z\)` has only a trivial effect on `\(X\)`, then it's not *relevant* - even if it's truly exogenous, it does not matter because there's no variation in `\(X\)` we can isolate with it - And our small-sample bias will be big - Thus, weak instrument problem means that we probably shouldn't be using IV in small samples - This also means that it's really important that `\(cov(X, Z)\)` is not small - There are some rules of thumb for how strong an instrument must be to be counted as "not weak" - A t-statistic above 3, or an F statistic from a joint test of the instruments that is 10 or above - These rules of thumb aren't great - selecting a model on the basis of significance naturally biases your results - What you really want is to know the *population* effect of `\(Z\)` on `\(X\)` - you want the F-statistic from *that* to be 10+. Of course we don't actually know that --- # Weak instrument problem: simulation - Let's look at the output of `feols()` using a simulated dataset ```r set.seed(777) df <- tibble(Y = rpois(200, lambda = 4), Z = rbinom(200, 1, prob = 0.4), X1 = Z * rbinom(200, 1, prob = 0.8), X2 = rnorm(200), G = sample(letters[1:4], 200, replace = TRUE)) ``` --- # Weak instrument problem: simulation ```r iv <- feols(Y ~ X2 | X1 ~ Z, data = df, se = 'hetero') thef <- fitstat(iv, 'ivf', verbose = FALSE)$`ivf1::X1`$stat iv ``` ``` ## TSLS estimation, Dep. Var.: Y, Endo.: X1, Instr.: Z ## Second stage: Dep. Var.: Y ## Observations: 200 ## Standard-errors: Heteroskedasticity-robust ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.616743 0.164506 21.98548 < 2.2e-16 *** ## fit_X1 1.066064 0.369839 2.88251 0.0043831 ** ## X2 0.326317 0.153746 2.12244 0.0350497 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 1.93622 Adj. R2: 0.009568 ## F-test (1st stage), X1: stat = 522.5, p < 2.2e-16 , on 1 and 197 DoF. ## Wu-Hausman: stat = 11.1, p = 0.001029, on 1 and 196 DoF. ``` - 522.47 is way above 10! We're probably fine in this particular regression --- # Overidentification tests - "Overidentification" just means we have more identifying conditions (validity assumptions) than we actually need. We only need one instrument, but we have two (or more) - So we can compare what we get using each instrument individually - If we assume that *at least one of them is valid*, and they both produce similar results, then that's evidence that *both* are valid --- # Overidentification tests: simulation - We can do this using `diagnostics = TRUE` in `iv_robust` again ```r set.seed(7) # Create data where Z1 is valid and Z2 is invalid df <- tibble(Z1 = rnorm(1000), Z2 = rnorm(1000), X = Z1 + Z2 + rnorm(1000), # True effect is 1 Y = X + Z2 + rnorm(1000)) iv <- feols(Y ~ 1 | X ~ Z1 + Z2, data = df, se = 'hetero') fitstat(iv, 'sargan') ``` ``` ## Sargan: stat = 248.7, p < 2.2e-16, on 1 DoF. ``` - That's a small p-value. We can reject that the results are similar for each IV, telling us that one is endogenous (although without seeing the actual data generating process we couldn't guess if it were `\(Z1\)` or `\(Z2\)` ) --- # Overidentification tests: simulation - How different are they? What did the test see that it was comparing? ```r iv1 <- feols(Y ~ 1 | X ~ Z1, data = df) iv2 <- feols(Y ~ 1 | X ~ Z2, data = df) export_summs(iv1, iv2, statistics = c(N = 'nobs')) ```
Model 1
Model 2
(Intercept)
0.03
0.01
(0.04)
(0.05)
fit_X
1.06 ***
1.91 ***
(0.04)
(0.05)
N
1000
1000
*** p < 0.001; ** p < 0.01; * p < 0.05.
Notice the first model gives an accurate coefficient of 1 --- # Quantile regression - Consider the very simple OLS version testing this model using the experimental data: `\begin{align*} Y_i = \alpha + D_i ' \beta + U_i \end{align*}` where `\(Y_i\)` is an outcome variable, `\(D_i\)` is a tretment variable - Recall that this will estimate our ATE for the treatment - What is the interpretation of this affect? - `\(E(Y_i (1)) - E(Y_i (0))\)`, i.e. the expected change in the outcome for a person moving from untreated to treated. That’s a useful metric - In other words, it characterizes features of the **mean** of our outcome variable, conditional on covariates --- # Quantile regression - What if we care about other things but the mean? - Quantile regression also solves the problems with - Skewed variables – no more worrying about logs or outliers in the outcome variable - Censoring – in many datasets, our outcome variables are top-coded or bottom-coded - But it has its own issues - it is noisier - it is challenging to interpret in an intuitive way - If you have underlying theory that has implications for distribution, quantile regression is the rigth tool for empirical analysis --- # Quantile regression: simulation ```r set.seed(7) df <- tibble(x = seq(0, 100, length.out = 100), # non-constant variance sig = 0.1 + 0.05 * x, y = 6 + 0.1 * x + rnorm(100,mean = 0, sd = sig)) ``` --- # Quantile regression: simulation Let's simulate the dataset with normal random error with non-constant variance <!-- --> - We can see the increasing variability: as `\(X\)` gets bigger, `\(Y\)` becomes more variable --- # Quantile regression: simulation - The estimated mean conditional on `\(X\)` is still unbiased, but it doesn’t tell us much about the relationship between `\(X\)` and `\(Y\)`, especially as `\(X\)` gets larger - To perform quantile regression, use the `quantreg` package and specify `\(\text{tau}\)` - the quantile you are interested in ```r library(quantreg) qr <- rq(y ~ x, df, tau = 0.9) ``` ``` ## ## Call: rq(formula = y ~ x, tau = 0.9, data = df) ## ## tau: [1] 0.9 ## ## Coefficients: ## coefficients lower bd upper bd ## (Intercept) 6.77933 6.34561 7.17300 ## x 0.14786 0.12911 0.16153 ``` - The `\(X\)` coefficient estimate of 0.148 says "one unit increase in `\(X\)` is associated with 0.148 increase in the `\(90\)` quantile of `\(Y\)`" - The "lower bd" and "upper bd" values are confidence intervals calculated using the "rank" method (to read more about calculating confidence intervals, use `?summary.rq`) --- # Quantile regression: simulation ```r qr2 <- rq(y ~ x, data = df, tau = seq(.1, .9, by = .1)) summary.rq(qr2) ``` ``` ## ## Call: rq(formula = y ~ x, tau = seq(0.1, 0.9, by = 0.1), data = df) ## ## tau: [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ## ## Coefficients: ## coefficients lower bd upper bd ## (Intercept) 5.95383 5.14707 6.06363 ## x 0.04302 0.02601 0.05842 ``` - The intercept estimate doesn’t change much but the slopes steadily increase --- # Quantile regression: simulation - Let's plot our quantile estimates <!-- --> --- # Quantile regression: simulation <!-- --> - Each black dot is the slope coefficient for the quantile indicated on the x axis. The red lines are the least squares estimate and its confidence interval - You can see how the lower and upper quartiles are well beyond the least squares estimate --- # Quantile regression: inference - There are several alternative methods of conducting inference about quantile regression coefficients - rank-inversion confidence intervals: `summary.rq(qr)` - more conventional standard errors: `summary.rq(qr, se = "nid")` - bootstraped stanard errors: `summary.rq(qr, se = "boot")` --- # References - Huntington-Klein, N. The Effect: An Introduction to Research Design and Causality, Chapter 19, https://theeffectbook.net - Huntington-Klein, N. Econometrics Course Slides, Week 8, https://github.com/stnavdeev/EconometricsSlides - Cunningham, S. Causal Inference: The Mixtape, Chapter 7, https://mixtape.scunning.com/instrumental-variables.html - Adams, C. Learning Microeconometrics with R, Chapter 6